Most Significant Substring Mining Based on Chi-square Measure
نویسندگان
چکیده
Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.
منابع مشابه
Mining Statistically Significant Substrings Based on the Chi-Square Measure
Given the vast reservoirs of data stored worldwide, efficient mining of data from a large information store has emerged as a great challenge. Many databases like that of intrusion detection systems, web-click records, player statistics, texts, proteins etc., store strings or sequences. Searching for an unusual pattern within such long strings of data has emerged as a requirement for diverse app...
متن کاملMining Statistically Significant Patterns using the Chi-Square Statistic
Statistical significance is used to ascertain whether the outcome of a given experiment can be ascribed to some extraneous factors or is solely due to chance. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to randomness or chance alone. In the thesis, we study the problem of identifying the statistically relevant patterns in string...
متن کاملMining Statistically Significant Substrings using the Chi-Square Statistic
The problem of identification of statistically significant patterns in a sequence of data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring systems, computational biology, cryptology, and text analysis. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to ra...
متن کاملAnalysis and summarization of correlations in data cubes
This paper presents a novel mechanism to analyze and summarize the statistical correlations among the attributes of a data cube. To perform the analysis and summarization, this paper proposes a new measure of statistical significance. The main reason for proposing the new measure of statistical significance is to have an essential closure property, which is exploited in the summarization stage ...
متن کاملA Comparative Study of Hard and Fuzzy Data Clustering Algorithms with Cluster Validity Indices
Data clustering is one of the important data mining methods. It is a process of finding classes of a data set with most similarity in the same class and most dissimilarity between different classes. The well known hard clustering algorithm (K -means) and Fuzzy clustering algorithm (FCM) are mostly based on Euclidean distance measure. In this paper, a comparative study of these algorithms with d...
متن کامل